Algorithms for Within-Cluster Searches Using Inverted Files
نویسندگان
چکیده
Information retrieval over clustered document collections has two successive stages: first identifying the best-clusters and then the best-documents in these clusters that are most similar to the user query. In this paper, we assume that an inverted file over the entire document collection is used for the latter stage. We propose and evaluate algorithms for within-cluster searches, i.e., to integrate the best-clusters with the best-documents to obtain the final output including the highest ranked documents only from the best-clusters. Our experiments on a TREC collection including 210,158 documents with several query sets show that an appropriately selected integration algorithm based on the query length and system resources can significantly improve the query evaluation efficiency.
منابع مشابه
Using Additional Indexes for Fast Full-Text Search of Phrases That Contains Frequently Used Words
Searches for phrases and word sets in large text arrays by means of additional indexes are considered. Their use may reduce the query-processing time by an order of magnitude in comparison with standard inverted files.
متن کاملTransaction / Regular Paper Title
Current high-throughput algorithms for constructing inverted files all follow the MapReduce framework, which presents a high-level programming model that hides the complexities of parallel programming. In this paper, we take an alternative approach and develop a novel strategy that exploits the current and emerging architectures of multicore processors. Our algorithm is based on a high-throughp...
متن کاملISIS: A New Approach for Efficient Similarity Search in Sparse Databases
High-dimensional sparse data is prevalent in many real-life applications. In this paper, we propose a novel index structure for accelerating similarity search in high-dimensional sparse databases, named ISIS, which stands for Indexing Sparse databases using Inverted fileS. ISIS clusters a dataset and converts the original high-dimensional space into a new space where each dimension represents a...
متن کاملDistributed Query Processing Using Suffix Arrays
Suffix arrays are more efficient than inverted files for solving complex queries in a number of applications related to text databases. Examples arise when dealing with biological or musical data or with texts written in oriental languages, and when searching for phrases, approximate patterns and, in general, regular expressions involving separators. In this paper we propose algorithms for proc...
متن کاملFast Text Access Methods for Optical and Large Magnetic Disks: Design and Performance Comparison
High capacity disks, especially optical ones, are commercially available. These disks are ideal for archiving large text data bases. In this work, we examine efficient searching techniques for such applications. We propose a unifying framework, which reveals the similarities between signature files and an inverted file using a hash table. Then, we design methods that combine the ease of inserti...
متن کامل